skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 10:00 PM ET on Friday, February 6 until 10:00 AM ET on Saturday, February 7 due to maintenance. We apologize for the inconvenience.


Search for: All records

Editors contains: "Kelso, Janet"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Kelso, Janet (Ed.)
    Abstract MotivationDisplaying proportional data across many spatially resolved coordinates is a challenging but important data visualization task, particularly for spatially resolved transcriptomics data. Scatter pie plots are one type of commonly used data visualization for such data but present perceptual challenges that may lead to difficulties in interpretation. Increasing the visual saliency of such data visualizations can help viewers more accurately identify proportional trends and compare proportional differences across spatial locations. ResultsWe developed scatterbar, an open-source R package that extends ggplot2, to visualize proportional data across many spatially resolved coordinates using scatter stacked bar plots. We apply scatterbar to visualize deconvolved cell-type proportions from a spatial transcriptomics dataset of the adult mouse brain to demonstrate how scatter stacked bar plots can enhance the distinguishability of proportional distributions compared to scatter pie plots. Availability and implementationscatterbar is available on CRAN https://cran.r-project.org/package=scatterbar with additional documentation and tutorials at https://jef.works/scatterbar/. 
    more » « less
  2. Kelso, Janet (Ed.)
    Abstract MotivationRecent advancements in natural language processing have highlighted the effectiveness of global contextualized representations from protein language models (pLMs) in numerous downstream tasks. Nonetheless, strategies to encode the site-of-interest leveraging pLMs for per-residue prediction tasks, such as crotonylation (Kcr) prediction, remain largely uncharted. ResultsHerein, we adopt a range of approaches for utilizing pLMs by experimenting with different input sequence types (full-length protein sequence versus window sequence), assessing the implications of utilizing per-residue embedding of the site-of-interest as well as embeddings of window residues centered around it. Building upon these insights, we developed a novel residual ConvBiLSTM network designed to process window-level embeddings of the site-of-interest generated by the ProtT5-XL-UniRef50 pLM using full-length sequences as input. This model, termed T5ResConvBiLSTM, surpasses existing state-of-the-art Kcr predictors in performance across three diverse datasets. To validate our approach of utilizing full sequence-based window-level embeddings, we also delved into the interpretability of ProtT5-derived embedding tensors in two ways: firstly, by scrutinizing the attention weights obtained from the transformer’s encoder block; and secondly, by computing SHAP values for these tensors, providing a model-agnostic interpretation of the prediction results. Additionally, we enhance the latent representation of ProtT5 by incorporating two additional local representations, one derived from amino acid properties and the other from supervised embedding layer, through an intermediate fusion stacked generalization approach, using an n-mer window sequence (or, peptide/fragment). The resultant stacked model, dubbed LMCrot, exhibits a more pronounced improvement in predictive performance across the tested datasets. Availability and implementationLMCrot is publicly available at https://github.com/KCLabMTU/LMCrot. 
    more » « less
  3. Kelso, Janet (Ed.)
    Abstract Motivation DNA-based data storage is a quickly growing field that hopes to harness the massive theoretical information density of DNA molecules to produce a competitive next-generation storage medium suitable for archival data. In recent years, many DNA-based storage system designs have been proposed. Given that no common infrastructure exists for simulating these storage systems, comparing many different designs along with many different error models is increasingly difficult. To address this challenge we introduce FrameD, a simulation infrastructure for DNA storage systems that leverages the underlying modularity of DNA storage system designs to provide a framework to express different designs while being able to reuse common components. Results We demonstrate the utility of FrameD and the need for a common simulation platform using a case study. Our case study compares designs that utilize strand copies differently, some that align strand copies using Multiple Sequence Alignment (MSA) algorithms and others that do not. We found that the choice to include MSA in the pipeline is dependent on the error rate and the type of errors being injected and is not always beneficial. In addition to supporting a wide range of designs, FrameD provides the user with transparent parallelism to deal with a large number of reads from sequencing and the need for many fault injection iterations. We believe that FrameD fills a void in the tools publicly available to the DNA storage community by providing a modular and extensible framework with support for massive parallelism. As a result, it will help accelerate the design process of future DNA-based storage systems. Availability and implementation The source code for FrameD along with the data generated during the demonstration of FrameD is available in a public Github repository at https://github.com/dna-storage/framed (10.5281/zenodo.7757762) 
    more » « less
  4. Kelso, Janet (Ed.)
    Abstract MotivationDriven by technological advances, the throughput and cost of mass spectrometry (MS) proteomics experiments have improved by orders of magnitude in recent decades. Spectral library searching is a common approach to annotating experimental mass spectra by matching them against large libraries of reference spectra corresponding to known peptides. An important disadvantage, however, is that only peptides included in the spectral library can be found, whereas novel peptides, such as those with unexpected post-translational modifications (PTMs), will remain unknown. Open modification searching (OMS) is an increasingly popular approach to annotate modified peptides based on partial matches against their unmodified counterparts. Unfortunately, this leads to very large search spaces and excessive runtimes, which is especially problematic considering the continuously increasing sizes of MS proteomics datasets. ResultsWe propose an OMS algorithm, called HOMS-TC, that fully exploits parallelism in the entire pipeline of spectral library searching. We designed a new highly parallel encoding method based on the principle of hyperdimensional computing to encode mass spectral data to hypervectors while minimizing information loss. This process can be easily parallelized since each dimension is calculated independently. HOMS-TC processes two stages of existing cascade search in parallel and selects the most similar spectra while considering PTMs. We accelerate HOMS-TC on NVIDIA’s tensor core units, which is emerging and readily available in the recent graphics processing unit (GPU). Our evaluation shows that HOMS-TC is 31× faster on average than alternative search engines and provides comparable accuracy to competing search tools. Availability and implementationHOMS-TC is freely available under the Apache 2.0 license as an open-source software project at https://github.com/tycheyoung/homs-tc. 
    more » « less
  5. Kelso, Janet (Ed.)
    Abstract MotivationNative top-down proteomics (nTDP) integrates native mass spectrometry (nMS) with top-down proteomics (TDP) to provide comprehensive analysis of protein complexes together with proteoform identification and characterization. Despite significant advances in nMS and TDP software developments, a unified and user-friendly software package for analysis of nTDP data remains lacking. ResultsWe have developed MASH Native to provide a unified solution for nTDP to process complex datasets with database searching capabilities in a user-friendly interface. MASH Native supports various data formats and incorporates multiple options for deconvolution, database searching, and spectral summing to provide a “one-stop shop” for characterizing both native protein complexes and proteoforms. Availability and implementationThe MASH Native app, video tutorials, written tutorials, and additional documentation are freely available for download at https://labs.wisc.edu/gelab/MASH_Explorer/MASHSoftware.php. All data files shown in user tutorials are included with the MASH Native software in the download .zip file. 
    more » « less
  6. Kelso, Janet (Ed.)
    Abstract Motivation Current methods for genotype imputation and phasing exploit the volume of data in haplotype reference panels and rely on hidden Markov models (HMMs). Existing programs all have essentially the same imputation accuracy, are computationally intensive and generally require prephasing the typed markers. Results We introduce a novel data-mining method for genotype imputation and phasing that substitutes highly efficient linear algebra routines for HMM calculations. This strategy, embodied in our Julia program MendelImpute.jl, avoids explicit assumptions about recombination and population structure while delivering similar prediction accuracy, better memory usage and an order of magnitude or better run-times compared to the fastest competing method. MendelImpute operates on both dosage data and unphased genotype data and simultaneously imputes missing genotypes and phase at both the typed and untyped SNPs (single nucleotide polymorphisms). Finally, MendelImpute naturally extends to global and local ancestry estimation and lends itself to new strategies for data compression and hence faster data transport and sharing. Availability and implementation Software, documentation and scripts to reproduce our results are available from https://github.com/OpenMendel/MendelImpute.jl. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  7. Kelso, Janet (Ed.)
    Abstract Motivation Sequence alignment is one of the first steps in many modern genomic analyses, such as variant detection, transcript abundance estimation and metagenomic profiling. Unfortunately, it is often a computationally expensive procedure. As the quantity of data and wealth of different assays and applications continue to grow, the need for accurate and fast alignment tools that scale to large collections of reference sequences persists. Results In this article, we introduce PuffAligner, a fast, accurate and versatile aligner built on top of the Pufferfish index. PuffAligner is able to produce highly sensitive alignments, similar to those of Bowtie2, but much more quickly. While exhibiting similar speed to the ultrafast STAR aligner, PuffAligner requires considerably less memory to construct its index and align reads. PuffAligner strikes a desirable balance with respect to the time, space and accuracy tradeoffs made by different alignment tools and provides a promising foundation on which to test new alignment ideas over large collections of sequences. Availability and implementation All the data used for preparing the results of this paper can be found with 10.5281/zenodo.4902332. PuffAligner is a free and open-source software. It is implemented in C++14 and can be obtained from https://github.com/COMBINE-lab/pufferfish/tree/cigar-strings. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  8. Kelso, Janet (Ed.)
    Abstract Motivation Genetic or epigenetic events can rewire molecular networks to induce extraordinary phenotypical divergences. Among the many network rewiring approaches, no model-free statistical methods can differentiate gene-gene pattern changes not attributed to marginal changes. This may obscure fundamental rewiring from superficial changes. Results Here we introduce a model-free Sharma-Song test to determine if patterns differ in the second order, meaning that the deviation of the joint distribution from the product of marginal distributions is unequal across conditions. We prove an asymptotic chi-squared null distribution for the test statistic. Simulation studies demonstrate its advantage over alternative methods in detecting second-order differential patterns. Applying the test on three independent mammalian developmental transcriptome datasets, we report a lower frequency of co-expression network rewiring between human and mouse for the same tissue group than the frequency of rewiring between tissue groups within the same species. We also find secondorder differential patterns between microRNA promoters and genes contrasting cerebellum and liver development in mice. These patterns are enriched in the spliceosome pathway regulating tissue specificity. Complementary to previous mammalian comparative studies mostly driven by first-order effects, our findings contribute an understanding of system-wide second-order gene network rewiring within and across mammalian systems. Second-order differential patterns constitute evidence for fundamentally rewired biological circuitry due to evolution, environment, or disease. Availability The generic Sharma-Song test is available from the R package ‘DiffXTables’ at https://cran.r-project.org/package=DiffXTables. Other code and data are described in Methods. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less